1 Introduction

Intelligence definition
“The capacity of a system adapt itself with insufficient resources and knowledge” - Pei Waing.

Why R to study AI?
“R is one of the major languages for data science. It provides excellent visualisation features, which is essential to explore the data before submitting it to any automated learning, as well as assessing the results of the learning algorithm. Many R package for machine learning are available of the shelf and many modern methods in statistical learning are implemented in R as part of their development.”
From: https://lgatto.github.io/IntroMachineLearningWithR/an-introduction-to-machine-learning-with-r.html#why-r

2 Reading files & Regular Expressions

Regular expressions (regex) is a powerful ally when we’d like to clean/search for informations. So let’s take a breef overview in this subject because everyone will use it some time:

2.1 Library stringr

library('stringr') # string manipulation

From https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html:

There are four main families of functions in stringr:

  • Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors;
  • Whitespace tools to add, remove and manipulate whitespace;
  • Locale sensitive operations whose operations will vary from locale to locale;
  • Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are others tools.

2.2 Testing Regex Website

I’d like to recommend these two websites to construct and test the regex:

Regex info: https://docs.python.org/3/library/re.html
Test Online Regex: https://regex101.com/

For R, you can choose Flavor= Python (python and R regex are very similar because both regex structure come from pearl).

2.3 Example: readlines and regex

2.3.1 Reading txt files

Here we are going to read a txt file and then extract some informations about from the document:

file <- readLines("Ficha_Aluno.txt")

# Printing the file line by line
file
## [1] "Documento - Exemplo"    ""                      
## [3] ""                       "Turma: FIAP-04IA"      
## [5] "Nome: Raphael Prates"   "Trabalho Entregue: Sim"

2.3.2 Searching for expression

Here we are looking for two informations class name (Turma) and student’s name (Nome):

file <- readLines("Ficha_Aluno.txt")

# Finding the line that cointains Turma:
line_turma <- grep("Turma", file, value = TRUE)

# Extracting only the value
class <- str_extract(line_turma,'FIAP-[0]*[0-9][iI][aA]')

# Finding the line that cointains Nome:
line_nome <- grep("Nome", file, value = TRUE)

# Extracting only the value
name <- unlist(str_split(line_nome,pattern = ': '))[2]

# Printing the results

paste("I am", name,"from", class,". I'm studying Machine Learning in a MBA Course and I hope that you enjoy this portfolio!")
## [1] "I am Raphael Prates from FIAP-04IA . I'm studying Machine Learning in a MBA Course and I hope that you enjoy this portfolio!"

3 Kernel Analysis - “NYC Taxi: Fast & Curious”

Kernel name: “NYC Taxi EDA - Update: The fast & the curious”
Link: https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious

3.1 Introduction

Most of this work is a raw reproduction of the original one. The purpose of this reproduction is to test and learn the commands/steps used by the author to construct a good Exploratory Data Analysis. In the end, a topic about Kmeans was added complementing the original work.
When its possible we explored the reasons why the author choose the functions used and compared to others alternatives.

3.1.1 Libraries

library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('alluvial') # visualisation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('lubridate') # date and time
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps
library('xgboost') # modelling
library('caret') # modelling
library('profvis') # code execution time anaylsis

3.1.2 Load data

3.1.2.1 Original code

library("data.table")
library("tibble")

train <- as.tibble(fread('train.csv'))

3.1.2.2 Comparison with others alternatives

In R, read.csv is part of the regular functions and is used for load data.frame from a csv file. But when we’re dealing with a huge data.frame this function can take a long time to run.

print(paste("In this case the dataset is quite huge:",dim(train)[1], "rows and",
            dim(train)[2], "columns."))
## [1] "In this case the dataset is quite huge: 1458644 rows and 11 columns."

So in this part the author used a function called fread that performs much faster than read.csv (check the time of each function using profvis!!).
After that other function should be compared: load. This function is used to load variables that have been stored in a .RData file and runs very fast comparing with read.csv and fread.
When is a good ideia to use load? When it’s possible to use a background process to update the data.frame and save it in .RData file.
Let’s take a look at the three possibilities:

library("profvis")
library("data.table")
library("tibble")
library("readr")

profvis({
  # fread
  train <- as.tibble(fread("train.csv"))
  test <- as.tibble(fread("test.csv"))
  
  # read.csv
  train_readcsv <- read.csv("train.csv")
  
  # read_csv -> from "readr" package
  train_read_csv <- read_csv("train.csv")
  
  # loading RData
  save(train_readcsv, file = "train_data.RData")
  rm(train_readcsv)
  load(file = "train_data.RData")
})

3.1.2.3 Tibbles vs data frames

All the information bellow was “greped” from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
Tibbles
“Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).”

Major points:

  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!);
  • It never adjusts the names of variables (i.e names with spaces will keep the whitespace. Data.frame replaces whitespace for ‘.’);
  • When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen;
  • Tibbles are quite strict about subsetting. [ ] always returns another tibble. Contrast this with a data frame: sometimes [ ] returns a data frame and sometimes it just returns a vector.

3.1.3 File structure and content

A brief overview of our data can summaries the descriptive statistics values of the dataset and detect abnormal items or outliers.

For the summaries

summary(train)
##       id              vendor_id     pickup_datetime    dropoff_datetime  
##  Length:1458644     Min.   :1.000   Length:1458644     Length:1458644    
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :1.535                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##  passenger_count pickup_longitude  pickup_latitude dropoff_longitude
##  Min.   :0.000   Min.   :-121.93   Min.   :34.36   Min.   :-121.93  
##  1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99  
##  Median :1.000   Median : -73.98   Median :40.75   Median : -73.98  
##  Mean   :1.665   Mean   : -73.97   Mean   :40.75   Mean   : -73.97  
##  3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96  
##  Max.   :9.000   Max.   : -61.34   Max.   :51.88   Max.   : -61.34  
##  dropoff_latitude store_and_fwd_flag trip_duration    
##  Min.   :32.18    Length:1458644     Min.   :      1  
##  1st Qu.:40.74    Class :character   1st Qu.:    397  
##  Median :40.75    Mode  :character   Median :    662  
##  Mean   :40.75                       Mean   :    959  
##  3rd Qu.:40.77                       3rd Qu.:   1075  
##  Max.   :43.92                       Max.   :3526282
summary(test)
##       id              vendor_id     pickup_datetime    passenger_count
##  Length:625134      Min.   :1.000   Length:625134      Min.   :0.000  
##  Class :character   1st Qu.:1.000   Class :character   1st Qu.:1.000  
##  Mode  :character   Median :2.000   Mode  :character   Median :1.000  
##                     Mean   :1.535                      Mean   :1.662  
##                     3rd Qu.:2.000                      3rd Qu.:2.000  
##                     Max.   :2.000                      Max.   :9.000  
##  pickup_longitude  pickup_latitude dropoff_longitude dropoff_latitude
##  Min.   :-121.93   Min.   :37.39   Min.   :-121.93   Min.   :36.60   
##  1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99   1st Qu.:40.74   
##  Median : -73.98   Median :40.75   Median : -73.98   Median :40.75   
##  Mean   : -73.97   Mean   :40.75   Mean   : -73.97   Mean   :40.75   
##  3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96   3rd Qu.:40.77   
##  Max.   : -69.25   Max.   :42.81   Max.   : -67.50   Max.   :48.86   
##  store_and_fwd_flag
##  Length:625134     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Data overview

library("dplyr")
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime   <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
glimpse(test)
## Observations: 625,134
## Variables: 9
## $ id                 <chr> "id3004672", "id3505355", "id1217141", "id2...
## $ vendor_id          <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1...
## $ pickup_datetime    <chr> "2016-06-30 23:59:58", "2016-06-30 23:59:53...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 1, 1...
## $ pickup_longitude   <dbl> -73.98813, -73.96420, -73.99744, -73.95607,...
## $ pickup_latitude    <dbl> 40.73203, 40.67999, 40.73758, 40.77190, 40....
## $ dropoff_longitude  <dbl> -73.99017, -73.95981, -73.98616, -73.98643,...
## $ dropoff_latitude   <dbl> 40.75668, 40.65540, 40.72952, 40.73047, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...

3.1.3.1 Comparison with others alternatives

Another popular way to make a data overview is using str. It is very similar to glimpse but str shows less data.

str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1458644 obs. of  11 variables:
##  $ id                : chr  "id2875421" "id2377394" "id3858529" "id3504673" ...
##  $ vendor_id         : int  2 1 2 2 2 2 1 2 1 2 ...
##  $ pickup_datetime   : chr  "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
##  $ dropoff_datetime  : chr  "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
##  $ passenger_count   : int  1 1 1 1 1 6 4 1 1 1 ...
##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
##  $ trip_duration     : int  455 663 2124 429 435 443 341 1551 255 1225 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.1.3.2 First observations

  • vendor_id only takes the values 1 or 2, presumably to differentiate two taxi companies
    We can easily check this doing:
levels(as.factor(train$vendor_id))
## [1] "1" "2"
  • pickup_datetime and (in the training set) dropoff_datetime are combinations of date and time that we will have to re-format into a more useful shape
  • passenger_count takes a median of 1 and a maximum of 9 in both data sets
  • The pickup/dropoff_longitute/latitute describes the geographical coordinates where the meter was activate/deactivated
  • store_and_fwd_flag is a flag that indicates whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”). Maybe there could be a correlation with certain geographical areas with bad reception?
  • trip_duration: our target feature in the training data is measured in seconds.

3.1.4 Missing values

To avoid an inappropriate analysis of the data, the missing values should be analysed to measure their impact in the whole dataset.
If the number of cases is less than 5% of the sample, then the researcher can drop them.
For more info about this subject: https://www.statisticssolutions.com/missing-values-in-data/
Luckly there is no missing values in data (easy mode):

sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0

3.1.5 Combining train and test

Here the author did an interesting move: he combined train and test data sets into a single one in order to avoid a closely approach that matches just one part of data.
CAUTION: we can only combine the two data sets for a better overview but for the creation of a machine learning model we should keep train and test separate.

# Mutate creates dset, dropff_datetime and trip_duration columns for test dataset
# For train dataset only dset is created by mutate
# bind_rows combines the data sets into one
combine <- bind_rows(train %>% mutate(dset = "train"), 
                     test %>% mutate(dset = "test",
                                     dropoff_datetime = NA,
                                     trip_duration = NA))
combine <- combine %>% mutate(dset = factor(dset))
glimpse(combine)
## Observations: 2,083,778
## Variables: 12
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime   <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dset               <fct> train, train, train, train, train, train, t...

3.1.6 Reformating features

For our following analysis, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualise relationships that involve these features.

library('lubridate')
train <- train %>%
  mutate(pickup_datetime = ymd_hms(pickup_datetime),
         dropoff_datetime = ymd_hms(dropoff_datetime),
         vendor_id = factor(vendor_id),
         passenger_count = factor(passenger_count))

3.1.7 Consistency check

ASSUME NOTHING! It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably the former were directly computed from the latter, but you never know. Below, the check variable shows “TRUE” if the two intervals are not consistent:

train %>%
  mutate(check = abs(int_length(interval(dropoff_datetime,pickup_datetime)) + trip_duration) > 0) %>%
  select(check, pickup_datetime, dropoff_datetime, trip_duration) %>%
  group_by(check) %>%
  count()
## # A tibble: 1 x 2
## # Groups:   check [1]
##   check       n
##   <lgl>   <int>
## 1 FALSE 1458644

And we find that everything fits perfectly.

3.2 Individual feature visualizations

“Visualizations of feature distributions and their relations are key to understanding a data set, and they often open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even subtle trends and correlations.”

3.2.1 Leaflet package for interative maps

The author used a wonderful package for interative maps called leaflet. There a couple of models that you can try. We changed a little bit to try the leaflet-extras providers.

library(leaflet)
library(leaflet.extras)
set.seed(1234)
foo <- sample_n(train, 8e3)

  leaflet(data = foo) %>% addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "blue", fillOpacity = 0.3)

Using this visualization feature we notice the majority points of our trips (Manhattan and JFK airport).

3.2.2 Trip Duration

library(ggplot2)

train %>%
  ggplot(aes(trip_duration)) +
  geom_histogram(fill = "red", bins = 150) +
  scale_x_log10() +
  scale_y_sqrt() +
  labs(title = "New York Taxi - EDA", x = "Trip Duration (s)", y = "Number of Events")

We find:

  • the majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just short of 1000 seconds, i.e. about 17 minutes.
  • There are several suspiciously short rides with less than 10 seconds duration.
  • Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:
train %>%
  arrange(desc(trip_duration)) %>%
  select(trip_duration, pickup_datetime, dropoff_datetime, everything()) %>%
  head(10)
## # A tibble: 10 x 11
##    trip_duration pickup_datetime     dropoff_datetime    id      vendor_id
##            <int> <dttm>              <dttm>              <chr>   <fct>    
##  1       3526282 2016-02-13 22:46:52 2016-03-25 18:18:14 id0053~ 1        
##  2       2227612 2016-01-05 06:14:15 2016-01-31 01:01:07 id1325~ 1        
##  3       2049578 2016-02-13 22:38:00 2016-03-08 15:57:38 id0369~ 1        
##  4       1939736 2016-01-05 00:19:42 2016-01-27 11:08:38 id1864~ 1        
##  5         86392 2016-02-15 23:18:06 2016-02-16 23:17:58 id1942~ 2        
##  6         86391 2016-05-31 13:00:39 2016-06-01 13:00:30 id0593~ 2        
##  7         86390 2016-05-06 00:00:10 2016-05-07 00:00:00 id0953~ 2        
##  8         86387 2016-06-30 16:37:52 2016-07-01 16:37:39 id2837~ 2        
##  9         86385 2016-06-23 16:01:45 2016-06-24 16:01:30 id1358~ 2        
## 10         86379 2016-05-17 22:22:56 2016-05-18 22:22:35 id2589~ 2        
## # ... with 6 more variables: passenger_count <fct>,
## #   pickup_longitude <dbl>, pickup_latitude <dbl>,
## #   dropoff_longitude <dbl>, dropoff_latitude <dbl>,
## #   store_and_fwd_flag <chr>

Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.

3.2.3 Pickup and dropoff datetime

Over the year, the distributions of pickup_datetime and dropoff_datetime look like this: mark and even a few way above it:

p1 <- train %>%
  ggplot(aes(pickup_datetime)) +
  geom_histogram(fill = "red", bins = 120) +
  labs(x = "Pickup dates")

p2 <- train %>%
  ggplot(aes(dropoff_datetime)) +
  geom_histogram(fill = "blue", bins = 120) +
  labs(x = "Dropoff dates")

layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)

Fairly homogeneous, covering half a year between January and July 2016. There is an interesting drop around late January early February:

train %>%
  filter(pickup_datetime > ymd("2016-01-20") & pickup_datetime < ymd("2016-02-10")) %>%
  ggplot(aes(pickup_datetime)) +
  geom_histogram(fill = "red", bins = 120)

3.2.3.1 Raised questions from pickup_datetime data visualization

That’s winter in NYC, so maybe snow storms or other heavy weather? Events like this should be taken into account, maybe through some handy external data set?

3.2.4 Passager count, vendor_id, store_and_fwd_flag, Day of the week and Hour of the Day

In the plot above we can already see some daily and weekly modulations in the number of trips. Let’s investigate these variations together with the distributions of passenger_count and vendor_id by creating a multi-plot panel with different components:

p1 <- train %>%
  group_by(passenger_count) %>%
  count() %>%
  ggplot(aes(passenger_count, n, fill = passenger_count)) +
  geom_col() +
  scale_y_sqrt() +
  theme(legend.position = "none")

p2 <- train %>%
  ggplot(aes(vendor_id, fill = vendor_id)) +
  geom_bar() +
  theme(legend.position = "none")

p3 <- train %>%
  ggplot(aes(store_and_fwd_flag)) +
  geom_bar() +
  theme(legend.position = "none") +
  scale_y_log10()

p4 <- train %>%
  mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
  group_by(wday, vendor_id) %>%
  count() %>%
  ggplot(aes(wday, n, colour = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Day of the week", y = "Total number of pickups") +
  theme(legend.position = "none")

p5 <- train %>%
  mutate(hpick = hour(pickup_datetime)) %>%
  group_by(hpick, vendor_id) %>%
  count() %>%
  ggplot(aes(hpick, n, color = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Hour of the day", y = "Total number of pickups") +
  theme(legend.position = "none")

layout <- matrix(c(1,2,3,4,5,5),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)

We find:

  • There are a few trips with zero, or seven to nine passengers but they are a rare exception:
train %>%
  group_by(passenger_count) %>%
  count()
## # A tibble: 10 x 2
## # Groups:   passenger_count [10]
##    passenger_count       n
##    <fct>             <int>
##  1 0                    60
##  2 1               1033540
##  3 2                210318
##  4 3                 59896
##  5 4                 28404
##  6 5                 78088
##  7 6                 48333
##  8 7                     3
##  9 8                     1
## 10 9                     1
  • The vast majority of rides had only a single passenger, with two passengers being the (distant) second most popular option.
  • Towards larger passenger numbers we are seeing a smooth decline through 3 to 4, until the larger crowds (and larger cars) give us another peak at 5 to 6 passengers.
  • Vendor 2 has significantly more trips in this data set than vendor 1 (note the logarithmic y-axis). This is true for every day of the week.
  • We find an interesting pattern with Monday being the quietest day and Friday very busy. This is the same for the two different vendors, with vendor_id == 2 showing significantly higher trip numbers.
  • As one would intuitively expect, there is a strong dip during the early morning hours. There we also see not much difference between the two vendors. We find another dip around 4pm and then the numbers increase towards the evening.
  • The store_and_fwd_flag values, indicating whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”), show that there was almost no storing taking place (note again the logarithmic y-axis):
train %>%
  group_by(store_and_fwd_flag) %>%
  count()
## # A tibble: 2 x 2
## # Groups:   store_and_fwd_flag [2]
##   store_and_fwd_flag       n
##   <chr>                <int>
## 1 N                  1450599
## 2 Y                     8045
y_count <- table(train$store_and_fwd_flag)['Y']/sum(table(train$store_and_fwd_flag))
paste0('Trip data stored in memory due to no connection represents ',round(y_count*100, digits = 2),'% of the values.')
## [1] "Trip data stored in memory due to no connection represents 0.55% of the values."

3.2.5 Time series graphics

The trip volume per hour of the day depends somewhat on the month and strongly on the day of the week:

p1 <- train %>%
  mutate(hpick = hour(pickup_datetime),
         Month = factor(month(pickup_datetime, label = TRUE))) %>%
  group_by(hpick, Month) %>%
  count() %>%
  ggplot(aes(hpick, n, color = Month)) +
  geom_line(size = 1.5) +
  labs(x = "Hour of the day", y = "count")

p2 <- train %>%
  mutate(hpick = hour(pickup_datetime),
         wday = factor(wday(pickup_datetime, label = TRUE))) %>%
  group_by(hpick, wday) %>%
  count() %>%
  ggplot(aes(hpick, n, color = wday)) +
  geom_line(size = 1.5) +
  labs(x = "Hour of the day", y = "count")

layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1, p2, layout=layout)

We find:

  • January and June have fewer trips, whereas March and April are busier months. This tendency is observed for both vendor_ids.
  • The weekend (Sat and Sun, plus Fri to an extend) have higher trip numbers during the early morning ours but lower ones in the morning between 5 and 10, which can most likely be attributed to the contrast between NYC business days and weekend night life. In addition, trip numbers drop on a Sunday evening/night.
    Finally, we will look at a simple overview visualization of the pickup/dropoff latitudes and longitudes:
p1 <- train %>%
  filter(pickup_longitude > -74.05 & pickup_longitude < -73.7) %>%
  ggplot(aes(pickup_longitude)) +
  geom_histogram(fill = "red", bins = 40)

p2 <- train %>%
  filter(dropoff_longitude > -74.05 & dropoff_longitude < -73.7) %>%
  ggplot(aes(dropoff_longitude)) +
  geom_histogram(fill = "blue", bins = 40)

p3 <- train %>%
  filter(pickup_latitude > 40.6 & pickup_latitude < 40.9) %>%
  ggplot(aes(pickup_latitude)) +
  geom_histogram(fill = "red", bins = 40)

p4 <- train %>%
  filter(dropoff_latitude > 40.6 & dropoff_latitude < 40.9) %>%
  ggplot(aes(dropoff_latitude)) +
  geom_histogram(fill = "blue", bins = 40)

layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)

Here we had constrain the range of latitude and longitude values, because there are a few cases which are way outside the NYC boundaries. The resulting distributions are consistent with the focus on Manhattan that we had already seen on the map. These are the most extreme values from the pickup_latitude feature:

train %>%
  arrange(pickup_latitude) %>%
  select(pickup_latitude, pickup_longitude) %>%
  head(5)
## # A tibble: 5 x 2
##   pickup_latitude pickup_longitude
##             <dbl>            <dbl>
## 1            34.4            -65.8
## 2            34.7            -75.4
## 3            35.1            -71.8
## 4            35.3            -72.1
## 5            36.0            -77.4
train %>%
  arrange(desc(pickup_latitude)) %>%
  select(pickup_latitude, pickup_longitude) %>%
  head(5)
## # A tibble: 5 x 2
##   pickup_latitude pickup_longitude
##             <dbl>            <dbl>
## 1            51.9            -72.8
## 2            44.4            -67.0
## 3            43.9            -71.9
## 4            43.5            -74.2
## 5            43.1            -72.6

We need to keep the existence of these (rather astonishing) values in mind so that they don’t bias our analysis.

3.2.6 Most common latitudes and longitudes for dropoff/pickup

Pickup point

train %>%
  select(pickup_longitude, pickup_latitude) %>%
  group_by(pickup_longitude, pickup_latitude) %>%
  count(sort = TRUE) %>%
  summary()
##  pickup_longitude  pickup_latitude       n         
##  Min.   :-121.93   Min.   :34.36   Min.   : 1.000  
##  1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: 1.000  
##  Median : -73.98   Median :40.75   Median : 1.000  
##  Mean   : -73.97   Mean   :40.75   Mean   : 1.056  
##  3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: 1.000  
##  Max.   : -61.34   Max.   :51.88   Max.   :39.000

Dropoff point

train %>%
  select(dropoff_longitude, dropoff_latitude) %>%
  group_by(dropoff_longitude, dropoff_latitude) %>%
  count(sort = TRUE) %>%
  summary()
##  dropoff_longitude dropoff_latitude       n         
##  Min.   :-121.93   Min.   :32.18    Min.   : 1.000  
##  1st Qu.: -73.99   1st Qu.:40.74    1st Qu.: 1.000  
##  Median : -73.98   Median :40.75    Median : 1.000  
##  Mean   : -73.97   Mean   :40.75    Mean   : 1.029  
##  3rd Qu.: -73.96   3rd Qu.:40.77    3rd Qu.: 1.000  
##  Max.   : -61.34   Max.   :43.92    Max.   :39.000

3.3 Feature relations

Now it’s time to examine how those features are related to each other and to our target trip duration.

3.3.1 Pickup date/time vs trip_duration

  • How does the variation in trip numbers throughout the day and the week affect the average trip duration?
  • Do quieter days and hours lead to faster trips?
p1 <- train %>%
  mutate(day_week = wday(pickup_datetime, label = TRUE)) %>%
  group_by(day_week, vendor_id) %>% 
  summarise(trip_duration_mean = median(trip_duration)/60) %>% 
  ggplot(aes(day_week, trip_duration_mean, color = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Day of the week",y = "Median trip duration [min]")

p2 <- train %>%
  mutate(hour_day = hour(pickup_datetime)) %>% 
  group_by(hour_day, vendor_id) %>% 
  summarise(trip_duration_mean = median(trip_duration)/60) %>% 
  ggplot(aes(hour_day, trip_duration_mean, color = vendor_id)) +
  geom_smooth(method = "loess", span = 1/2) +
  geom_point(size = 4) +
  labs(x = "Hour of the day",y = "Median trip duration [min]") +
  theme(legend.position = "none")

layout <- matrix(c(1,2), 2,1, byrow = FALSE)
multiplot(p1,p2, layout = layout)

We find:

  • There is indeed a similar pattern as for the business of the day of the week. Vendor 2, the one with the more frequent trips, also has consistently higher trip durations than vendor 1. It will be worth adding the vendor_id feature to a model to test its predictive importance.
  • Over the course of a typical day we find a peak in the early afternoon and dips around 5-6am and 8pm. The weekday and hour of a trip appear to be important features for predicting its duration and should be included in a successful model.

3.3.2 Passenger count and Vendor vs trip_duration

Are different numbers of passengers and/or the different vendors correlated with the duration of the trip?

train %>% 
  ggplot(aes(passenger_count, trip_duration, color = passenger_count)) +
  geom_boxplot() +
  scale_y_log10() +
  theme(legend.position = "none") +
  facet_wrap(~vendor_id) +
  labs(y = "Trip duration [s]", x = "Number of passengers")

We find:

  • Both vendors have short trips without any passagengers;
  • Vendor 1 has all of the trips beyond 24 hours, whereas vendor 2 has all of the (five) trips with more than six passengers and many more trips that approach the 24-hour limit.
  • Between 1 and 6 passengers the median trip durations are remarkably similar, in particular for vendor 2. There might be differences for vendor 1, but they are small (note the logarthmic y-axis):
train %>% 
  ggplot(aes(trip_duration, fill = vendor_id)) +
  geom_density(position = "stack") +
  scale_x_log10()

Comparing the densities of the trip_duration distribution for the two vendors we find the medians are very similar, whereas the means are likely skewed by vendor 2 containing most of the long-duration outliers:

train %>% 
  group_by(vendor_id) %>% 
  summarise(mean_duration = mean(trip_duration),
            median_duration = median(trip_duration))
## # A tibble: 2 x 3
##   vendor_id mean_duration median_duration
##   <fct>             <dbl>           <dbl>
## 1 1                  845.            658.
## 2 2                 1059.            666.

3.3.3 Store and Forward vs hour

The original analysis compared “Store and Forward vs trip_duration”, but I presume that the analysis for “Store and Forward” will better performed if we consider the location (longitude and latitude) and the dropoff time (paying time) due to some areas have a weaker signal and also when the possibility of the network being congested.

train %>%
  group_by(vendor_id, store_and_fwd_flag) %>%
  count()
## # A tibble: 3 x 3
## # Groups:   vendor_id, store_and_fwd_flag [3]
##   vendor_id store_and_fwd_flag      n
##   <fct>     <chr>               <int>
## 1 1         N                  670297
## 2 1         Y                    8045
## 3 2         N                  780302

As we can see above, Store and Forward appears only to vendor 1.
Does the flags Y appear with a higher frequency at a specific time of day?

train %>%
  mutate(hour_day = hour(dropoff_datetime)) %>%
  group_by(hour_day, store_and_fwd_flag) %>%
  count() %>% 
  ggplot(aes(hour_day, n, color = store_and_fwd_flag)) +
  geom_point(size = 4) +
  scale_y_log10() +
  labs(x = "Hour of the day",y = "No connection to the server occurrences") +
  ggtitle("Vendor_ID 1 - Store and Forward") +
  theme(plot.title = element_text(hjust = 0.5))

Flag Y follow the distribution frequency by hour of the Flag N, so we can’t determine exactly with how many occurences per hour the problem aggravates.

3.3.4 Store and Forward vs Dropoff Position

Let’s take a look in the position of the Store_and_fwd_flag = Y to try to find a correlation:

train %>%
  filter(store_and_fwd_flag == "Y") %>% 
  leaflet() %>%
  addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "blue", fillOpacity = 0.3)

Visually the map showed no relationship between a certain area and the Store_and_flag = Y.

3.4 Feature engineering

In this section we build new features from the existing ones, trying to find better predictors for our target variable.
The new temporal features (date,month,wday, hour) are derived from the pickup_datetime. We got the JFK and LA Guardia airport coordinates from Wikipedia. The blizzard feature is based on the external weather data.

jfk_coord <- tibble(lon = -73.778889, lat = 40.639722)
la_guardia_coord <- tibble(lon = -73.872611, lat = 40.77725)

pick_coord <- train %>% 
  select(pickup_longitude, pickup_latitude)

drop_coord <- train %>% 
  select(dropoff_longitude, dropoff_latitude)

train$dist <- distCosine(pick_coord, drop_coord)
train$bearing <- bearing(pick_coord, drop_coord)

train$jfk_dist_pick <- distCosine(pick_coord, jfk_coord)
train$jfk_dist_drop <- distCosine(drop_coord, jfk_coord)
train$lg_dist_pick <- distCosine(pick_coord, la_guardia_coord)
train$lg_dist_drop <- distCosine(drop_coord, la_guardia_coord)

train <- train %>% 
  mutate(speed = (dist/trip_duration)*3.6,
         date = date(pickup_datetime),
         month = month(pickup_datetime, label = TRUE),
         wday = wday(pickup_datetime, label = TRUE),
         wday = fct_relevel(wday,c("seg","ter", "qua", "qui", "sex", "sáb", "dom")),
         hour = hour(pickup_datetime),
         work = (hour %in% seq(8,18)) & (wday %in% c("seg","ter", "qua", "qui", "sex")),
         jfk_trip = (jfk_dist_pick < 2e3) | (jfk_dist_drop < 2e3),
         lg_trip = (lg_dist_pick < 2e3) | (lg_dist_drop < 2e3),
         blizzard = !((date < ymd("2016-01-22") | (date > ymd("2016-01-29"))))
  )

Let’s take a look in the created fields:

glimpse(train)
## Observations: 1,458,644
## Variables: 26
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <fct> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, ...
## $ dropoff_datetime   <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, ...
## $ passenger_count    <fct> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dist               <dbl> 1500.1995, 1807.5298, 6392.2513, 1487.1625,...
## $ bearing            <dbl> 99.932546, -117.063997, -159.608029, -172.7...
## $ jfk_dist_pick      <dbl> 22315.02, 20258.98, 21828.54, 21461.51, 236...
## $ jfk_dist_drop      <dbl> 21025.4008, 21220.9775, 20660.3899, 21068.1...
## $ lg_dist_pick       <dbl> 9292.897, 10058.778, 9092.997, 13228.107, 8...
## $ lg_dist_drop       <dbl> 7865.248, 11865.578, 13461.012, 14155.920, ...
## $ speed              <dbl> 11.869710, 9.814641, 10.834324, 12.479686, ...
## $ date               <date> 2016-03-14, 2016-06-12, 2016-01-19, 2016-0...
## $ month              <ord> mar, jun, jan, abr, mar, jan, jun, mai, mai...
## $ wday               <ord> seg, dom, ter, qua, sáb, sáb, sex, sáb, sex...
## $ hour               <int> 17, 0, 11, 19, 13, 22, 22, 7, 23, 21, 22, 1...
## $ work               <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FAL...
## $ jfk_trip           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ lg_trip            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ blizzard           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...

3.4.1 Direct distance of the trip

From the coordinates of the pickup and dropoff points we can calculate the direct distance (as the crow flies) between the two points, and compare it to our trip_durations. Since taxis aren’t crows (in most practical scenarios), these values correspond to the minimum possible travel distance.

To compute these distances we are using the distCosine function of the geosphere package for the spherical trigonometry. This method gives us the sortest distance between two points on a spheical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape. Here are the raw values of distance vs duration (based on a down-sized sample to speed up the kernel).

set.seed(4321)

train %>% 
  sample_n(5e4) %>% 
  ggplot(aes(dist,trip_duration)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Direct distance [m]", y = "Trip duration [s]")

We find:

  • The distance generally increases with increasing trip_duration (ooohh!! Really?! :]);
  • Here, the 24-hours trips look even more suspicious and are even more likely to be artefacts in the data;
  • In addition, there are number of trips with very short distances, down to 1 meter, but with a large range of apparent trip_durations.

Let’s filter the data a little bit to remove the extreme (and the extremely suspicious) data points, and bin the data indo a 2-d histogram.
This plot shows that in log-log space the trip_duration is increasing slower than linear for large distance values:

train %>% 
  filter(trip_duration < 3600 & trip_duration > 120) %>% 
  filter(dist > 100 & dist < 100e3) %>% 
  ggplot(aes(dist,trip_duration)) +
  geom_bin2d(bins = c(500,500)) +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Direct distance [m]", y = "Trip duration [s]")

3.4.2 Travel Speed

Distance over time is of course velocity, and by computing the average apparent velocity of our taxis we will have another diagnostic to remove bogus values. Of course, we won’t be able to use speed as a predictor for our model, since it requires knowing the travel time, but it can still be helpful in cleaning up our training data and findind other features with predictive power. This is the speed distribution:

train %>% 
  filter(speed > 2 & speed < 1e2) %>% 
  ggplot(aes(speed)) +
  geom_histogram(fill = "red", bins = 50) +
  labs(x = "Average speed [km/h] (direct distance)")

Well, after removing the most extreme values this looks way better than I would have expected. An average speed of around 15 km/h sounds probably reasonable for NYC. Everything above 50 km/h certainly requires magical cars (or highway travel). Also keep in mind that this refers to the direct distance and that the real velocity would have been always higher.
In a similar way as the average duration per day and hour we can also investigate the average speed for these time bins:

p1 <- train %>%
  group_by(vendor_id,wday) %>% 
  summarise(speed_median = median(speed)) %>% 
  ggplot(aes(wday,speed_median, color = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Day of the week", y = "Median speed [km/h]" )

p2 <- train %>%
  group_by(vendor_id,hour) %>% 
  summarise(speed_median = median(speed)) %>% 
  ggplot(aes(hour,speed_median, color = vendor_id)) +
  geom_smooth(method = "loess", span = 1/2) +
  geom_point(size = 4) +
  labs(x = "Hour of the day", y = "Median speed [km/h]" )
  theme(legend.position = "none")
## List of 1
##  $ legend.position: chr "none"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
p3 <- train %>% 
  group_by(wday,hour) %>% 
  summarise(speed_median = median(speed)) %>% 
  ggplot(aes(hour,wday, fill = speed_median)) +
  geom_tile() +
  labs(x = "Hour of the day", y = "Day of the week") +
  scale_fill_distiller(palette = "Spectral")

layout <- matrix(c(1,2,3,3), 2,2, byrow = TRUE)
multiplot(p1,p2,p3, layout = layout)

We find:

  • Our taxis appear to be travelling faster on the weekend and on a Monday than during the rest of the week;
  • The early morning hours allow for a speedier trip, with everything from 8am to 6pm being similarly slow;
  • There are almost no differences between the two vendors;
  • The heatmap in the lower panel visualises how these trends combine to create a “low-speed-zone” in the middle of the day and week. Based on this, we create a new feature work, which we define as 8am-6pm on seg-sáb.

3.4.3 Airport distance

In our maps (above) and trip paths (below) we noticed that a number of trips began or ended at either of the two NYC airports: JFK and La Guardia. Since airports are usually not in the city center, it’s reasonable to assume that the pickup/dropoff distance from the airport could be a useful predictor for longer trip_durations. Above, we defined the coordinates of the two airports and computed the corresponding distances:

p1 <- train %>%
  ggplot(aes(jfk_dist_pick)) +
  geom_histogram(fill = "red", bins = 30) +
  scale_x_log10() +
  scale_y_sqrt() +
  geom_vline(xintercept = 2e3) +
  labs(x = "JFK pickup distance", y = "count" )

p2 <- train %>%
  ggplot(aes(lg_dist_pick)) +
  geom_histogram(fill = "red", bins = 30) +
  scale_x_log10() +
  scale_y_sqrt() +
  geom_vline(xintercept = 2e3) +
  labs(x = "La Guardia pickup distance", y = "count" )
  
p3 <- train %>%
  ggplot(aes(jfk_dist_drop)) +
  geom_histogram(fill = "blue", bins = 30) +
  scale_x_log10() +
  scale_y_sqrt() +
  geom_vline(xintercept = 2e3) +
  labs(x = "JFK dropoff distance", y = "count" )

p4 <- train %>%
  ggplot(aes(lg_dist_drop)) +
  geom_histogram(fill = "blue", bins = 30) +
  scale_x_log10() +
  scale_y_sqrt() +
  geom_vline(xintercept = 2e3) +
  labs(x = "La Guardia pickup distance", y = "count" )

layout <- matrix(c(1,2,3,4), 2,2, byrow = TRUE)
multiplot(p1,p2,p3,p4,layout = layout)

Based on these numbers, we can define a JFK/La Guardia trip as having a pickup or dropoff distance of less than 2 km from the corresponding airport.

What are the trip_durations of these journeys?

trip_durations. Above, we defined the coordinates of the two airports and computed the corresponding distances:

p1 <- train %>%
  filter(trip_duration < 23*3600) %>% 
  ggplot(aes(jfk_trip, trip_duration, color = jfk_trip)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(x = "JFK trip", y = "Trip Duration" ) +
  theme(legend.position = "none")

p2 <- train %>%
  filter(trip_duration < 23*3600) %>% 
  ggplot(aes(lg_trip, trip_duration, color = lg_trip)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(x = "La Guardia trip", y = "Trip Duration" ) +
  theme(legend.position = "none")
  

layout <- matrix(c(1,2), 1,2, byrow = TRUE)
multiplot(p1,p2,layout = layout)

We find that our hypothesis was correct and that trips to the airport, in particular the more distant JFK, have significantly longer average trip_durations. These two features should definitely be part of our model.

3.5 Data cleaning

Before we turn to the modelling it is time to clean up our training data. We have waited to do this until now to have a more complete overview of the problematic observations. The aim here is to remove trips that have improbable features, such as extreme trip duration.
While there might also be a number of bogus trip durations in the test data we shouldn’t be able to predict them in any case (unless there were some real correlations). By removing these training data values we will make our model more robust and more likely to generalise to unseen data, which is always our primary goal in machine learning.

3.5.1 Extreme trip durations

Let’s visualise the distances of the trips that took a day or longer. Unless someone took a taxi from NYC to LA, it’s unlikely that those values are accurate. Here we make use of the maps package to draw an outline of Manhattan, where most of the trips begin or end. We then overlay the pickyp coordinates in red, and the dropoff coordinates in blue.
To further add the trip connections (direct distance) we use another handy tool from the geosphere package; GCiNTERMEDIANTE allow us to interpolate the path between two sets of coordinates.

3.5.1.1 Longer than a day

We start with the few trips that pretend to have taken several days to complete:

day_plus_trips <- train %>%
  filter(trip_duration > 24*3600)

day_plus_trips %>% select(pickup_datetime, dropoff_datetime, speed)
## # A tibble: 4 x 3
##   pickup_datetime     dropoff_datetime      speed
##   <dttm>              <dttm>                <dbl>
## 1 2016-01-05 00:19:42 2016-01-27 11:08:38 0.0374 
## 2 2016-02-13 22:38:00 2016-03-08 15:57:38 0.0105 
## 3 2016-01-05 06:14:15 2016-01-31 01:01:07 0.00265
## 4 2016-02-13 22:46:52 2016-03-25 18:18:14 0.0203
ny_map <- as.tibble(map_data("state", region = "new york:manhattan"))

tpick <- day_plus_trips %>% 
  select(lon = pickup_longitude, lat = pickup_latitude)

tdrop <- day_plus_trips %>% 
  select(lon = dropoff_longitude, lat = dropoff_latitude)

p1 <- ggplot() +
  geom_polygon(data=ny_map, aes(x=long, y=lat), fill = "grey60") +
  geom_point(data=tpick, aes(x=lon, y=lat), size=1, color = 'red', alpha = 1) +
  geom_point(data=tdrop, aes(x=lon, y=lat), size =1, color = 'blue', alpha = 1)

for (i in seq(1,nrow(tpick))) {
  inter <- as.tibble(gcIntermediate(tpick[i,], tdrop[i,], n=30, addStartEnd = TRUE))
  p1 <- p1 + geom_line(data=inter,aes(x=lon, y=lat), color = 'blue', alpha = .75)
}

p1 + ggtitle("Longer than a day trips in relation to Manhattan")

We find nothing out of the ordinary here. While a trip to JFK can seem like an eternity if your flight is boarding soon, it’s unlikely to take this long in real time. The average taxi speeds don’t look very likely either.

Decision: These values should be removed from the training data set for continued exploration and modelling.

3.5.1.2 Close to 24 hours

The author don’t think (so do I) it’s inconceivable that someone takes a taxi for a trip that lasts almost a day (with breaks, of course). In very rare occasions this might happen; provided, of course, that the distance travelled was sufficiently long.
Here we define day-long trips as taking between 22 and 24 hours, which covers a small peak in our raw trip_duration distribution. Those are the top 5 direct distances (in m) among the day-long trips:

day_trips <- train %>% 
  filter(trip_duration < 24*3600 & trip_duration > 22*3600)

day_trips %>% 
  arrange(desc(dist)) %>% 
  select(dist, pickup_datetime, dropoff_datetime,speed) %>% 
  head(5)
## # A tibble: 5 x 4
##     dist pickup_datetime     dropoff_datetime    speed
##    <dbl> <dttm>              <dttm>              <dbl>
## 1 60666. 2016-06-04 13:54:29 2016-06-05 13:40:30 2.55 
## 2 42401. 2016-01-28 21:43:02 2016-01-29 21:33:30 1.78 
## 3 28116. 2016-03-14 22:46:05 2016-03-15 22:24:27 1.19 
## 4 22808. 2016-06-18 17:41:47 2016-06-19 16:30:41 1.000
## 5 22759. 2016-06-01 19:52:42 2016-06-02 19:35:09 0.960

The top one is about 60 km, which is not particulary far. The average speed wouldn’t suggest a generous tip, either. What do these trips look like on the map?

set.seed(2017)

day_trips <- day_trips %>% 
  sample_n(200)

tpick <- day_trips %>% 
  select(lon = pickup_longitude, lat = pickup_latitude)

tdrop <- day_trips %>% 
  select(lon = dropoff_longitude, lat = dropoff_latitude)

p1 <- ggplot() +
  geom_polygon(data=ny_map, aes(x=long, y=lat), fill = "grey60") +
  geom_point(data=tpick, aes(x=lon, y=lat), size=1, color = 'red', alpha = 1) +
  geom_point(data=tdrop, aes(x=lon, y=lat), size =1, color = 'blue', alpha = 1)

for (i in seq(1,nrow(tpick))) {
  inter <- as.tibble(gcIntermediate(tpick[i,], tdrop[i,], n=30, addStartEnd = TRUE))
  p1 <- p1 + geom_line(data=inter,aes(x=lon, y=lat), color = 'blue', alpha = .255)
}

p1 + ggtitle("Day-Long trips in relation to Manhattan")

Here we are plotting only 200 of the about 1800 connections to keep the map reasonably readable and the script fast. Pickup points are red and dropoff points are blue.

We find:

  • A few longer distances stand out, but they are exceptions. The two major group of trips are those within Manhattan and those between Manhattan and the airports;
  • There is little to suggest that these extreme trip_durations were real;
  • There is another insight here which is rather intuitive: trips to or from any of the airports (most prominently JFK) are unlikely to be very short. Thus, there are a close distance of either pickup or dropoff to the airport could be a valuable predictor for longer trip_diration.. This is something that we took from here to the feature engineering.

Decision: We will remove _trip_durations longe than 22 hours from the exploration and possibly from the modelling.

3.5.1.3 Shorter than a few minutes

On the other side of the trip_duration distribution we have those rides that appear to only have lasted for a couple of minutes. While such short trips are entirely possible, let’s check their durations and speeds to make sure that they are realistic.

min_trips <- train %>% 
  filter(trip_duration < 5*60)

min_trips %>% 
  arrange(dist) %>% 
  select(dist, pickup_datetime, dropoff_datetime,speed) %>% 
  head(5)
## # A tibble: 5 x 4
##    dist pickup_datetime     dropoff_datetime    speed
##   <dbl> <dttm>              <dttm>              <dbl>
## 1    0. 2016-02-29 18:39:12 2016-02-29 18:42:59    0.
## 2    0. 2016-01-27 22:29:31 2016-01-27 22:29:58    0.
## 3    0. 2016-01-22 16:13:01 2016-01-22 16:13:20    0.
## 4    0. 2016-01-18 15:24:43 2016-01-18 15:28:57    0.
## 5    0. 2016-05-04 22:28:43 2016-05-04 22:32:51    0.
3.5.1.3.1 Zero-distance trips

In doing so, we notice that there are a relatively large number of zero-distance trips:

zero_dist <- train %>%
  filter(near(dist,0))

nrow(zero_dist)
## [1] 5897

What are their nominal top durations?

zero_dist %>% 
  arrange(desc(trip_duration)) %>% 
  select(trip_duration, pickup_datetime, dropoff_datetime,vendor_id) %>% 
  head(5)
## # A tibble: 5 x 4
##   trip_duration pickup_datetime     dropoff_datetime    vendor_id
##           <int> <dttm>              <dttm>              <fct>    
## 1         86352 2016-06-05 01:09:39 2016-06-06 01:08:51 2        
## 2         85333 2016-01-01 15:27:28 2016-01-02 15:09:41 2        
## 3         78288 2016-05-18 13:40:45 2016-05-19 11:25:33 2        
## 4          5929 2016-06-08 16:47:44 2016-06-08 18:26:33 2        
## 5          4683 2016-05-25 17:36:49 2016-05-25 18:54:52 2

There really are a few taxis where the data wants to tell us they have not moved at all for about a day. While carrying a passenger. We choose not to believe the data in this case.
Once we remove the extreme cases, this is what the distribution looks like:

zero_dist %>% 
  filter(trip_duration < 6000) %>% 
  ggplot(aes(trip_duration, fill = vendor_id)) +
  geom_histogram(bins = 50) +
  scale_x_log10()

We find:

  • trip_durations of about a minute might still be somehow possible, assuming that someone got into a taxi but then changed their mind before the taxi could move. Whether that should count as a “trip” is a different question. But trip durations of about 15 minutes (900s) without any distance covered seem hardly possible. Unless they involve traffic jams, but who get’s into a taxi that’s in a traffic jam?
  • It is also noteworthy that most trips in the less-than-a-minute-group were from vendor 1.

Decision: we will remove those zero-distance trips that took more than a minute for our continued analysis. Removing them from the modelling might be detrimental if there are similar trips in the test sample.

3.5.1.3.2 Short trips with above zero distance

After removing the zero-distance trips, those are the short rides with the highest average speed:

min_trips <- train %>%
  filter(trip_duration < 5*60 & dist > 0)

min_trips %>% 
  arrange(desc(speed)) %>% 
  select(trip_duration, dist, pickup_datetime,speed) %>% 
  head(10)
## # A tibble: 10 x 4
##    trip_duration    dist pickup_datetime     speed
##            <int>   <dbl> <dttm>              <dbl>
##  1             7  18055. 2016-02-13 20:28:30 9285.
##  2           282 320484. 2016-04-02 20:33:19 4091.
##  3             2    784. 2016-06-12 06:35:13 1412.
##  4            51  19970. 2016-06-19 20:18:35 1410.
##  5           279 104877. 2016-03-20 21:07:56 1353.
##  6             2    704. 2016-06-23 13:36:48 1267.
##  7            20   6919. 2016-05-28 15:14:19 1245.
##  8             3    914. 2016-05-30 17:12:12 1097.
##  9             4   1031. 2016-01-17 03:11:56  928.
## 10             3    762. 2016-06-13 16:34:30  914.

Clearly, the top speed values are impossible. We include a map plot of the general distribution of distances within the sample:

set.seed(1234)

foo <- min_trips %>% 
  sample_n(600)

tpick <- foo %>% 
  select(lon = pickup_longitude, lat = pickup_latitude)

tdrop <- foo %>%
  select (lon = dropoff_longitude, lat = dropoff_latitude)

p1 <- ggplot () +
  geom_polygon(data = ny_map, aes(x=long, y=lat), fill = "grey60") +
  geom_point(data=tpick, aes(x=lon, y=lat), size = 1, color = 'red', alpha = 1) +
  geom_point(data=tdrop, aes(x=lon,y=lat), size = 1, color = 'blue', alpha = 1)

for (i in seq(1, nrow(tpick))) {
  inter <- as.tibble(gcIntermediate(tpick[i,], tdrop[i,], n = 30, addStartEnd = TRUE))
  p1 <- p1 + geom_line(data=inter, aes(x=lon,y=lat), color = 'blue', alpha=.25)
}

p1 + ggtitle("Minute-long trips in relation to Manhattan")

We find (from the hidden plot):

  • Most distances are in fact short, which means that combined with setting an average speed limit we should be able to remove those values that are way beyond being realistic. This should also get rid of many trips that appear to have durations of seconds only.

Decision: we impose a lower trip_duration limit of 10 seconds and a (very conservative) speed limit of 100 km/h. (Remember that this refers to the direct distance).

3.5.2 Intermission - The best spurious trips

Every data set has a few entries that are just flat out ridiculous. Here are the best ones from this one, with pickup or dropoff locations more than 300 km from NYC (JFK airport):

long_dist <- train %>% 
  filter( (jfk_dist_pick > 3e5) | (jfk_dist_drop > 3e5) )

long_dist_coord <- long_dist %>% 
  select(lon = pickup_longitude, lat = pickup_latitude)

long_dist %>% 
  select(id, jfk_dist_pick, jfk_dist_drop, dist, trip_duration, speed) %>% 
  arrange(desc(jfk_dist_pick))
## # A tibble: 31 x 6
##    id        jfk_dist_pick jfk_dist_drop      dist trip_duration     speed
##    <chr>             <dbl>         <dbl>     <dbl>         <int>     <dbl>
##  1 id2854272      4128727.      4128719.      14.8           499    0.107 
##  2 id3777240      4128721.      4128726.      21.8          1105    0.0711
##  3 id2306955      1253574.        21484. 1242299.            792 5647.    
##  4 id1974018      1115646.      1115646.       0.            369    0.    
##  5 id0267429       988748.       988748.       0.            961    0.    
##  6 id0838705       695767.       481141.  215468.           1131  686.    
##  7 id0205460       684133.       684133.       0.            329    0.    
##  8 id0978162       674251.       941618.  315117.            875 1296.    
##  9 id3525158       665877.       665877.       0.            385    0.    
## 10 id1510552       642663.       472020.  892212.            611 5257.    
## # ... with 21 more rows

Many zero-distance trips with more than a minute duration ,which we would remove anyway. But just out of curiousity, where did they happen? (We will again use the amazing leaflet package, this time with individual markers that give us id and direct distance information for mouse-over and click actions).

leaflet(long_dist_coord) %>% 
  addTiles() %>% 
  setView(-92.00, 41.0, zoom = 4) %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addMarkers(popup = ~as.character(long_dist$dist), label = ~as.character(long_dist$id))

Watch out!! Russian submarines sighted!!!!
9 positions are actually ocean-going taxis. Two are near to San Francisco. And then most of others will be removed by our previous cleaning limits.
These long-distance locations represent outlisers that should be removed to improve the robustness of predictive models.

3.5.3 Final cleaning

Here we apply the cleaning filter that are discussed above.
the amazing leaflet package, this time with individual markers that give us id and direct distance information for mouse-over and click actions).

train <- train %>% 
  filter(trip_duration < 22*3600,
        dist > 0 | (near(dist,0) & trip_duration < 60),
        jfk_dist_pick < 3e5 & jfk_dist_drop < 3e5,
        trip_duration > 10,
        speed < 100)

3.6 Kmeans - Getting centers for dropoff/pickup coordinates

Now that we have cleaned our data, let’s extract the kmeans and cluster our data with these values.

_“K-Means is one of the most popular”clustering" algorithms. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster’s centroid than any other centroid. K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) chosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.“_
For more info: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

3.6.1 Finding the best number of clusters

3.6.1.1 Pickup position

library("plotly")

foo <- sample_n(train, 8e3)
quality_between_per_totss <- data.frame(
  betweenss_per_totss = as.numeric(),
  clusters = as.numeric(),
  stringsAsFactors = FALSE
  )

quality_between_per_totss_vector <- c(
  betweenss_per_totss = as.numeric(),
  clusters = as.numeric()
)

for (cluster in 1:30) {
  kmeans_pickup <- kmeans(foo[,c("pickup_longitude", "pickup_latitude")], centers = cluster,
                          algorithm = "Lloyd", iter.max = 500)
  quality_between_per_totss_vector["betweenss_per_totss"] <- kmeans_pickup$betweenss/kmeans_pickup$totss
  quality_between_per_totss_vector["clusters"] <- length(levels(as.factor(kmeans_pickup$cluster)))
  quality_between_per_totss <- bind_rows(quality_between_per_totss,quality_between_per_totss_vector)
}
plot_ly(quality_between_per_totss,
        x = quality_between_per_totss$clusters,
        y = quality_between_per_totss$betweenss_per_totss) %>%
        layout(
          xaxis = list(title = "Number of clusters"),
          yaxis = list(title = "Betweens/totss")
          )

3.6.1.2 Dropoff position

quality_between_per_totss <- data.frame(
  betweenss_per_totss = as.numeric(),
  clusters = as.numeric(),
  stringsAsFactors = FALSE
  )

quality_between_per_totss_vector <- c(
  betweenss_per_totss = as.numeric(),
  clusters = as.numeric()
)

for (cluster in 1:30) {
  kmeans_dropoff <- kmeans(foo[,c("dropoff_longitude", "dropoff_latitude")], centers = cluster,
                          algorithm = "Lloyd", iter.max = 1000)
  quality_between_per_totss_vector["betweenss_per_totss"] <- kmeans_dropoff$betweenss/kmeans_dropoff$totss
  quality_between_per_totss_vector["clusters"] <- length(levels(as.factor(kmeans_dropoff$cluster)))
  quality_between_per_totss <- bind_rows(quality_between_per_totss,quality_between_per_totss_vector)
}
plot_ly(quality_between_per_totss,
        x = quality_between_per_totss$clusters,
        y = quality_between_per_totss$betweenss_per_totss) %>%
        layout(
          xaxis = list(title = "Number of clusters"),
          yaxis = list(title = "Betweens/totss")
          )

So for this situation, number of clusters = 15 seems the best choice due to its position in the beginning of the stabilization of the curve.

3.6.2 Find the centers of the K-Means Model

icons_kpickup <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = "blue"
)

icons_kdropoff <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = "orange"
)

kmeans_pickup <- kmeans(train[,c("pickup_longitude", "pickup_latitude")], centers = 15,
                 algorithm = "Lloyd", iter.max = 1000)
kcenters_pickup <- as.data.frame(kmeans_pickup$centers)
kmeans_dropoff <- kmeans(train[,c("dropoff_longitude", "dropoff_latitude")], centers = 15,
                         algorithm = "Lloyd", iter.max = 1000)
kcenters_dropoff <- as.data.frame(kmeans_dropoff$centers)

leaflet(data = train) %>% addProviderTiles(providers$Esri.WorldTopoMap) %>%
  addAwesomeMarkers(lng = kcenters_pickup$pickup_longitude , lat = kcenters_pickup$pickup_latitude,
                    icon = icons_kpickup, label = "Pickup") %>%
  addAwesomeMarkers(lng = kcenters_dropoff$dropoff_longitude , lat = kcenters_dropoff$dropoff_latitude,
                    icon = icons_kdropoff, label = "Dropoff")

3.6.3 Clustering by K-Centers

train <- train %>%
          mutate(KCluster_pickup = as.factor(kmeans_pickup$cluster),
                 KCluster_dropoff = as.factor(kmeans_dropoff$cluster))

glimpse(train)
## Observations: 1,450,263
## Variables: 28
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <fct> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, ...
## $ dropoff_datetime   <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, ...
## $ passenger_count    <fct> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dist               <dbl> 1500.1995, 1807.5298, 6392.2513, 1487.1625,...
## $ bearing            <dbl> 99.932546, -117.063997, -159.608029, -172.7...
## $ jfk_dist_pick      <dbl> 22315.02, 20258.98, 21828.54, 21461.51, 236...
## $ jfk_dist_drop      <dbl> 21025.4008, 21220.9775, 20660.3899, 21068.1...
## $ lg_dist_pick       <dbl> 9292.897, 10058.778, 9092.997, 13228.107, 8...
## $ lg_dist_drop       <dbl> 7865.248, 11865.578, 13461.012, 14155.920, ...
## $ speed              <dbl> 11.869710, 9.814641, 10.834324, 12.479686, ...
## $ date               <date> 2016-03-14, 2016-06-12, 2016-01-19, 2016-0...
## $ month              <ord> mar, jun, jan, abr, mar, jan, jun, mai, mai...
## $ wday               <ord> seg, dom, ter, qua, sáb, sáb, sex, sáb, sex...
## $ hour               <int> 17, 0, 11, 19, 13, 22, 22, 7, 23, 21, 22, 1...
## $ work               <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FAL...
## $ jfk_trip           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ lg_trip            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ blizzard           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ KCluster_pickup    <fct> 5, 4, 5, 10, 15, 4, 3, 15, 1, 13, 5, 12, 3,...
## $ KCluster_dropoff   <fct> 5, 7, 6, 6, 2, 3, 1, 1, 7, 2, 7, 1, 15, 5, ...

3.7 Conclusion

The analysis of this kernel ends with KMeans because this was the last subject studied at class.

In the near future I intend to continue the study of this Kernel and to test it with some Machine Learning Model.

Stay tuned =]